36 research outputs found

    Rethinking the Pipeline of Demosaicing, Denoising and Super-Resolution

    Full text link
    Incomplete color sampling, noise degradation, and limited resolution are the three key problems that are unavoidable in modern camera systems. Demosaicing (DM), denoising (DN), and super-resolution (SR) are core components in a digital image processing pipeline to overcome the three problems above, respectively. Although each of these problems has been studied actively, the mixture problem of DM, DN, and SR, which is a higher practical value, lacks enough attention. Such a mixture problem is usually solved by a sequential solution (applying each method independently in a fixed order: DM →\to DN →\to SR), or is simply tackled by an end-to-end network without enough analysis into interactions among tasks, resulting in an undesired performance drop in the final image quality. In this paper, we rethink the mixture problem from a holistic perspective and propose a new image processing pipeline: DN →\to SR →\to DM. Extensive experiments show that simply modifying the usual sequential solution by leveraging our proposed pipeline could enhance the image quality by a large margin. We further adopt the proposed pipeline into an end-to-end network, and present Trinity Enhancement Network (TENet). Quantitative and qualitative experiments demonstrate the superiority of our TENet to the state-of-the-art. Besides, we notice the literature lacks a full color sampled dataset. To this end, we contribute a new high-quality full color sampled real-world dataset, namely PixelShift200. Our experiments show the benefit of the proposed PixelShift200 dataset for raw image processing.Comment: Code is available at: https://github.com/guochengqian/TENe

    LLM as A Robotic Brain: Unifying Egocentric Memory and Control

    Full text link
    Embodied AI focuses on the study and development of intelligent systems that possess a physical or virtual embodiment (i.e. robots) and are able to dynamically interact with their environment. Memory and control are the two essential parts of an embodied system and usually require separate frameworks to model each of them. In this paper, we propose a novel and generalizable framework called LLM-Brain: using Large-scale Language Model as a robotic brain to unify egocentric memory and control. The LLM-Brain framework integrates multiple multimodal language models for robotic tasks, utilizing a zero-shot learning approach. All components within LLM-Brain communicate using natural language in closed-loop multi-round dialogues that encompass perception, planning, control, and memory. The core of the system is an embodied LLM to maintain egocentric memory and control the robot. We demonstrate LLM-Brain by examining two downstream tasks: active exploration and embodied question answering. The active exploration tasks require the robot to extensively explore an unknown environment within a limited number of actions. Meanwhile, the embodied question answering tasks necessitate that the robot answers questions based on observations acquired during prior explorations

    Diffusion Priors for Dynamic View Synthesis from Monocular Videos

    Full text link
    Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions that are occluded or partially observed in the given videos. To address these issues, we first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. Subsequently, we distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves geometric consistency while preserving the scene identity. We perform thorough experiments to evaluate the efficacy of the proposed method qualitatively and quantitatively. Our results demonstrate the robustness and utility of our approach in challenging cases, further advancing dynamic novel view synthesis

    Exploring Open-Vocabulary Semantic Segmentation without Human Labels

    Full text link
    Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to train open-vocabulary zero-shot semantic segmentation models. Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. We evaluate ZeroSeg on multiple popular segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO, in a zero-shot manner (i.e., no training or adaption on target segmentation datasets). Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data, while also performing competitively compared to strongly supervised methods. Finally, we also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation, through both human studies and qualitative visualizations

    Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

    Full text link
    We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images. Our code, models, and generated 3D assets are available at https://github.com/guochengqian/Magic123.Comment: webpage: https://guochengqian.github.io/project/magic123
    corecore